`epi_df` argument refactoring #460

dsweber2 · 2024-06-07T17:03:21Z

Checklist

Please:

Make sure this PR is against "dev", not "main" (unless this is a release
PR).
Request a review from one of the current main reviewers:
brookslogan, nmdefries.
Makes sure to bump the version number in DESCRIPTION. Always increment
the patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
See DEVELOPMENT.md for more information on the development
process.

Change explanations for reviewer

There's a couple of things we want to change about the arguments to epi_df and/or as_epi_df:

columns that aren't named exactly time_value but are unambiguously meant to be that should be interpreted as such, e.g. date.
columns that aren't named exactly geo_value but are unambiguously meant to be that should be interpreted as such, e.g. geo_id.
tidyselect to handle renaming (so if you had both forecast_date and target_date you could tell it time_value = target_date to disambiguate)
break additional metadata into "additional" and other_keys=, by far the most useful part of that
update the docs to reflect these changes
all of these, also for epi_archives. Some have already happened (e.g. other_keys is already separated)

Current list of time_value equivalents:

c(
         time_value = "date",
        time_value = "time",
        time_value = "datetime",
        time_value = "dateTime",
        tmie_value = "date_time",
        time_value = "forecast_date",
        time_value = "target_date",
        time_value = "week",
        time_value = "day",
        time_value = "epiweek",
        time_value = "month",
        time_value = "year",
        time_value = "yearmon",
        time_value = "yearMon",
        time_value = "dates",
        time_value = "time_values",
        time_value = "forecast_dates",
        time_value = "target_dates"

)

Current list of geo_value equivalents:

c(
      geo_value = "geo_values",
      geo_value = "geo_id",
      geo_value = "geos",
      geo_value = "fips",
      geo_value = "zip",
      geo_value = "county",
      geo_value = "hrr",
      geo_value = "msa",
      geo_value = "state",
      geo_value = "province",
      geo_value = "nation",
      geo_value = "states",
      geo_value = "provinces",
      geo_value = "counties"
)

Current list of version equivalents:

c(
      version = "issue",
      version = "release"
)
And for all of these, the Snake_Case capitalized versions

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Resolves "Promote" other_keys to be printed, a constructor parameter, and more clearly documented #186, though some features there are out of scope; going to make separate issues for
- print.epi_df should print the other_keys metadata
- keys documentation
- possibly "regenerate the saved data sets",
Resolves [enh] promote other_keys #446
Resolves as_epi_df construction conviences #456

Future extensions

combining multiple columns into unique keys, e.g. time_value = join_by(month, year) where month and year are separate columns

brookslogan · 2024-06-07T18:09:05Z

Nice lists; some notes/opinions:

I don't think time_value should not be matched to forecast_date or forecast_dates automatically. target_date and target_dates seem fine though. This will help prevent misaligning forecast and signal data.
location should be a possibility for geo_value (think the Hub uses this), and maybe jurisdiction.
time_value seems like it's sort of missing week, epiweek, EW, month, mon, year, yearmon, but those would take more complicated logic as e.g., week or epiweek could be YYYYww format or ww format and rely on a separate year column, plus it's ambiguous what type of week numbering system is being used simply from the name (even "epiweek" can differ between the US and other nations).

nmdefries · 2024-06-07T19:13:55Z

time_value = time, datetime

dsweber2 · 2024-06-07T19:45:48Z

I don't think time_value should not be matched to forecast_date or forecast_dates automatically. target_date and target_dates seem fine though. This will help prevent misaligning forecast and signal data.

So, if both forecast_date and target_date are present, with the current setup it's going to throw an error, asking the user to specify which they want. Does that avoid the footguns you're hoping to avoid?

location should be a possibility for geo_value (think the Hub uses this), and maybe jurisdiction.

oh yeah that makes sense thanks!

time_value seems like it's sort of missing week, epiweek, EW, month, mon, year, yearmon, but those would take more complicated logic as e.g., week or epiweek could be YYYYww format or ww format and rely on a separate year column, plus it's ambiguous what type of week numbering system is being used simply from the name (even "epiweek" can differ between the US and other nations).

I guess I should just add whichever of these are supported by guess_time_type? The multi-column case does make sense, but I think I'm going to put that as out of scope for now and leave it as a future enhancement.

brookslogan · 2024-06-10T17:58:18Z

So, if both forecast_date and target_date are present, with the current setup it's going to throw an error, asking the user to specify which they want. Does that avoid the footguns you're hoping to avoid?

In that instance, yes, although I'd also be fine with guessing it to be the target_date.

But suppose there is only a forecast_date (no target_date); I don't think we should guess it to be the time_value.

I guess I should just add whichever of these are supported by guess_time_type? The multi-column case does make sense, but I think I'm going to put that as out of scope for now and leave it as a future enhancement.

Sorry, this was mostly me reasoning about why it's good to exclude these. You could maybe accept yearmon & yearmonth, and perhaps year if it appears by itself without month or mon or week or any other possible pairings you could think of. But the rest are too ambiguous by name. If you detect a an appropriate class from tsibble (e.g., whatever tsibble::yearweek outputs) then that is less fraught --- tsibble does disambiguate these. But the simplest "solution" is just to exclude all of these possibilities and require the user to specify.

dajmcdon · 2024-06-19T18:56:51Z

On forecast_date / target_date, I actually think we might want it to match forecast_date. For example, {hubUtils} would only contain a forecast_date in their model output: https://hubverse-org.github.io/hubData/reference/as_model_out_tbl.html

dsweber2 · 2024-06-25T00:09:34Z

Looking into actually implementing the other_keys change, I/someone should do that in a separate PR, because it's sufficiently wide-ranging w/docs/vignette updates that it should be separated out.

I think this is ready if someone wants to do a review. @lcbrooks I think most of the concerning cases you're thinking about will be taken care of by either multiple names triggering, and thus forcing an error. Seems like there are legit use cases for forecast_date, so if it's the only date-like, we should be using it.

brookslogan · 2024-06-25T00:20:51Z

But even if we have chopped off target date, we would not want forecast date to be the time value. We would want to first reattach the target date then convert.. if we were ever to put these in an epi df at all, it seems a bit of a mismatch vs a dedicated predictions format or archive. I think forecast date should just be excluded from the considered set.

dsweber2 · 2024-06-25T15:38:16Z

If someone were trying to say, smooth the scores, I could see using forecast_date on it's own. This was actually the use-case that got me started down making this. It just seems very prescriptive to say that a user making an epi_df where there's nothing else that looks like time value doesn't want time_value=forecast_date.

brookslogan · 2024-06-25T15:40:56Z

At the same time, we've had bugs from lining up forecast dates of forecasts with time values of signals, and making the default make this easier seems undesirable.

brookslogan · 2024-06-25T15:46:37Z

This is more into personal preferences, but I'd also actually probably prefer the prescriptive approach for data structures [or just column names, but we force people to use time_value regardless of what it actually represents which is the opposite of what I'm imagining] and a optionally nonprescriptive interface for functionality. E.g. an option or function to slide by forecast date rather than time value.

dsweber2 · 2024-06-25T15:49:39Z

In the interest of shipping things rather than leaving every PR open, I've dropped it as a default; time_value = forecast_date will work, so the option isn't gone gone, just not a default. I'd appreciate a review and merge from someone in the not too distant future.

nmdefries

Brief review, will be more thorough in second pass. Please add some tests for the new functions.

R/archive.R

R/epi_df.R

R/utils.R

R/archive.R

dsweber2 · 2024-07-12T22:00:46Z

@nmdefries ok, it's actually fixed. It was all inane docs stuff

nmdefries

Overall looks good. My two big comments are

Add classes to errors and warnings for easier and more robust testing. This could be turned into an issue and done later.
Reshuffle logic in guess_column_name to read more clearly.

Other comments are nits/come down to your judgement.

tests/testthat/test-epi_df.R

tests/testthat/test-archive.R

R/archive.R

R/epi_df.R

R/utils.R

tests/testthat/test-archive.R

nmdefries · 2024-07-19T17:17:35Z

R/utils.R

+  if (!(column_name %in% names(x))) {
+    x <- tryCatch(x %>% rename(any_of(substitutions)),
+      error = function(cond) {
+        cli_abort("{names(x)[names(x) %in% substitutions]} are both/all valid substitutions.
+Either `rename` some yourself or drop some.")
+      }
+    )
+    # if none of the names are in substitutions, and `column_name` isn't a column, we're missing a relevant column
+    if (!any(names(x) %in% substitutions)) {
+      cli_abort("There is no {column_name} column or similar name. See e.g. [`time_column_name()`] for a complete list")
+    }
+    if (any(substitutions != "")) {
+      cli_inform("inferring {column_name} column.")


suggestion: I find the logic in this chunk confusing. Here's an alternative flow that reads more clearly to me.

Move the "if none of the names are in substitutions..." chunk before the tryCatch. Having the condition !any(names(x) %in% substitutions) after the tryCatch makes it read as if this block will always trigger. Downside to moving this before the tryCatch is that it is no longer checking if the rename was successful (although we shouldn't really need that).

If the rename has an error, the rest of the tryCatch expression won't be run, so I moved the cli_inform there.

Suggested change

if (!(column_name %in% names(x))) {

x <- tryCatch(x %>% rename(any_of(substitutions)),

error = function(cond) {

cli_abort("{names(x)[names(x) %in% substitutions]} are both/all valid substitutions.

Either `rename` some yourself or drop some.")

}

)

# if none of the names are in substitutions, and `column_name` isn't a column, we're missing a relevant column

if (!any(names(x) %in% substitutions)) {

cli_abort("There is no {column_name} column or similar name. See e.g. [`time_column_name()`] for a complete list")

}

if (any(substitutions != "")) {

cli_inform("inferring {column_name} column.")

if (!(column_name %in% names(x))) {

# if none of the names are in substitutions, and `column_name` isn't a column, we're missing a relevant column

if (!any(names(x) %in% substitutions)) {

cli_abort("There is no {column_name} column or similar name. See e.g. [`time_column_name()`] for a complete list")

}

x <- tryCatch({

tmp <- x %>% rename(any_of(substitutions))

cli_inform("inferring {column_name} column.")

tmp

},

error = function(cond) {

cli_abort("{names(x)[names(x) %in% substitutions]} are both/all valid substitutions.

Either `rename` some yourself or drop some.")

}

)

~~why the tmp?~~ actually read it XD

dsweber2 self-assigned this Jun 20, 2024

dsweber2 mentioned this pull request Jun 25, 2024

as_epi_df convienence: combine separate day/month/year columns etc #476

Open

dsweber2 force-pushed the autoName branch from 08059b0 to e07ec94 Compare June 25, 2024 00:06

dsweber2 marked this pull request as ready for review June 25, 2024 00:07

dsweber2 requested review from brookslogan and nmdefries June 25, 2024 15:50

nmdefries reviewed Jul 3, 2024

View reviewed changes

R/archive.R Outdated Show resolved Hide resolved

R/epi_df.R Show resolved Hide resolved

R/epi_df.R Outdated Show resolved Hide resolved

R/utils.R Outdated Show resolved Hide resolved

R/archive.R Show resolved Hide resolved

dsweber2 force-pushed the autoName branch 2 times, most recently from f9499e6 to 243c20e Compare July 9, 2024 17:44

nmdefries self-requested a review July 10, 2024 20:05

dsweber2 force-pushed the autoName branch from 243c20e to 49af83a Compare July 11, 2024 16:31

nmdefries approved these changes Jul 19, 2024

View reviewed changes

dsweber2 and others added 6 commits July 19, 2024 13:25

basic auto-naming

4a6f7bf

geo_value and version, separate functions, more ex

284daaf

docs: document (GHA)

6b361da

errant renamed variables

3276e9e

More tests, ... tidyselect, doc as_epi_df, more values

00ce1da

docs: document (GHA)

287edd7

dsweber2 and others added 11 commits July 19, 2024 13:25

remove forecast_date as a default

75c62c2

happier linter

952e0e2

wrong test, better too many columns error

e3feab6

minor refactor to make col name subs accessible

ad70c69

Nat's suggestions

edbf1ac

avoid arg prefix-completion for versions_end

d19d1a5

style: styler (GHA)

f6904f2

docs: document (GHA)

76388e2

formatting

500a952

template needed changing

1b84d01

pkgdown fix for new functions, apparently

c59af46

dsweber2 force-pushed the autoName branch from cc54cd7 to c59af46 Compare July 19, 2024 18:25

dsweber2 added 2 commits July 19, 2024 15:21

recs from Nat, local checks passing

7d99e6d

desc, news

243c45e

dsweber2 merged commit 69ea5e4 into dev Jul 19, 2024
3 checks passed

dshemetov mentioned this pull request Jul 19, 2024

as_epi_df construction conviences #456

Closed

brookslogan mentioned this pull request Jul 23, 2024

[enh] promote other_keys #446

Closed

dshemetov deleted the autoName branch January 24, 2025 19:19

epi_df argument refactoring #460

epi_df argument refactoring #460

Uh oh!

Conversation

dsweber2 commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Future extensions

Uh oh!

brookslogan commented Jun 7, 2024

Uh oh!

nmdefries commented Jun 7, 2024

Uh oh!

dsweber2 commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brookslogan commented Jun 10, 2024

Uh oh!

dajmcdon commented Jun 19, 2024

Uh oh!

dsweber2 commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brookslogan commented Jun 25, 2024

Uh oh!

dsweber2 commented Jun 25, 2024

Uh oh!

brookslogan commented Jun 25, 2024

Uh oh!

brookslogan commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsweber2 commented Jun 25, 2024

Uh oh!

nmdefries left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsweber2 commented Jul 12, 2024

Uh oh!

nmdefries left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nmdefries Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsweber2 Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

`epi_df` argument refactoring #460

`epi_df` argument refactoring #460

dsweber2 commented Jun 7, 2024 •

edited

Loading

dsweber2 commented Jun 7, 2024 •

edited

Loading

dsweber2 commented Jun 25, 2024 •

edited

Loading

brookslogan commented Jun 25, 2024 •

edited

Loading

nmdefries left a comment •

edited

Loading

nmdefries Jul 19, 2024 •

edited

Loading

dsweber2 Jul 19, 2024 •

edited

Loading